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Abstract 

Background: /3-turns are secondary structure type that have essential role in molecular recognition, protein 
folding, and stability. They are found to be the most common type of non-repetitive structures since 25% of amino 
acids in protein structures are situated on them. Their prediction is considered to be one of the crucial problems in 
bioinformatics and molecular biology, which can provide valuable insights and inputs for the fold recognition and 
drug design. 

Results: We propose an approach that combines support vector machines (SVMs) and logistic regression (LR) in a 
hybrid prediction method, which we call (H-SVM-LR) to predict /3-turns in proteins. Fractional polynomials are used 
for LR modeling. We utilize position specific scoring matrices (PSSMs) and predicted secondary structure (PSS) as 
features. Our simulation studies show that H-SVM-LR achieves Qtotal of 82.87%, 82.84%, and 82.32% on the BT426, 
BT547, and BT823 datasets respectively. These values are the highest among other /3-turns prediction methods that 
are based on PSSMs and secondary structure information. H-SVM-LR also achieves favorable performance in 
predicting /3-turns as measured by the Matthew's correlation coefficient (MCC) on these datasets. Furthermore, 
H-SVM-LR shows good performance when considering shape strings as additional features. 

Conclusions: In this paper, we present a comprehensive approach for /3-turns prediction. Experiments show that 
our proposed approach achieves better performance compared to other competing prediction methods. 



Background 

Secondary structure of proteins consists of basic ele- 
ments; these elements are a-helices, /i-sheets, random 
coils, and turns, a-helices and /J-sheets are considered as 
regular secondary structure elements while the residues 
that correspond to turns structures do not form regular 
secondary structure elements. In turns structures the 
Ca-atoms of two residues are separated by one to five 
peptide bonds and the distance between these Ca-atoms 
is less than 7A°. The number of peptide bonds that sepa- 
rate the two end residues determines the specific turn 
type. In a-turns and /3-turns, the two end residues are 
separated by four and three peptide bonds respectively. 
In y-turns, <5-turns, and 7r-turns, the two end residues are 
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separated by two, one, and five peptide bonds respec- 
tively. The most common types of turns structure that 
exist in protein are /3-turns structure. They represent 
approximately 25% of the secondary structure of the pro- 
teins sequences. /3-turns can reverse the direction of a 
protein chain therefore they are considered as orienting 
structure [1]. They also have significant effects in protein 
folding, because they have the ability to bring together 
and allow the interactions between the regular secondary 
structure elements, /i-turns are not only important in 
protein folding but are also implicated in the biological 
activities of peptides as the bioactive structures that 
interact with other molecules such as receptors, enzymes 
and antibodies [2]. They are also important in the design 
of various peptidomimetics for many diseases [3], There- 
fore, the prediction of /3-turns is one of the important 
problems in molecular biology, which can provide 
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valuable insights and inputs for the fold recognition and 
drug design. 

There are different methods designed for /J-turns predic- 
tion. These methods can be divided into statistical meth- 
ods and machine learning methods. The statistical 
methods that are used in /J-turns prediction include 
Chou-Fasman method [4], Thornton's algorithm [5], 
GORBTURN [6], 1-4 & 2-3 correlation model [7], 
sequence couple model [8], and COUDES method [9]. All 
of these statistical methods use the sequence as input 
except for COUDES, which is based on propensities and 
multiple alignments. COUDES also utilizes secondary 
structure predicted by PSIPRED [10], SSPR02 [11], and 
PROF [12]. The machine learning methods include 
BTPRED [13], BetaTpred2 [14], MOLEBRNN [15] and 
NetTurnP [1], which are based on artificial neural net- 
works (ANNs), Kim's method based on k-nearest neighbor 
(KNN) [16], as well as support vector machines (SVMs) 
based methods, which recently have become popular in 
the field of /3-turns prediction. These SVMs based meth- 
ods include BTSVM [17], Zhang and colleagues' method 
[18], Zheng and Kurgan's method [2], Hu and Li's method 
[19], the method of Liu et al. [20], DEBT [21], and the 
method of Tang et al. [22]. In BTBRED, secondary struc- 
ture predictions are utilized with two layered network 
architecture. BetaTpred2 enhances the performance of 
/J-turns prediction by using secondary structure prediction 
and evolutionary information in form of position specific 
scoring matrices (PSSMs) as input to the neural networks. 
MOLEBRNN uses PSSMs as input to a bidirectional 
Elman-type recurrent neural network. NetTurnP uses evo- 
lutionary information and predicted protein sequence fea- 
tures as input to two ANN layers whereas the first layer is 
trained to predict whether or not an amino acid is located 
in a /3-turn. Kim's method encodes protein sequence using 
a window of up to 9 residues to be used as input to a 
KNN based method, which is combined with a filter that 
uses secondary structure predicted with PSIPRED for the 
central residue. In BTSVM, position specific frequent 
matrices (PSFMs) and PSSMs, both calculated with PSI- 
BLAST [23], are applied to encode input for SVM classi- 
fier. Zhang and colleagues' method is another SVM 
method that uses PSSMs over a 7-residue window and the 
secondary structure of the central residue predicted by 
PSIPRED as an input. In Zheng and Kurgan's method a 
SVM is utilized to predict /3-turns using window based 
information extracted from four predicted secondary 
structures (PSSs) with a selected set of PSSMs as input to 
the SVM. The SVM based method developed by Hu and 
Li combines the increment of diversity, position conserva- 
tion scoring function, and secondary structure predicted 
with PSIPRED to compute the inputs for prediction of 
/3-turns and y-turns. Liu et al. combine SVM with PSS 



information obtained by using E-SSpred, a secondary pro- 
tein structure prediction method. DEBT predicts /3-turns 
and their types using information from multiple sequence 
alignments, PSSs, and predicted dihedral angles. Tang et 
al. considered another type of one-dimensional string of 
symbols representing the clustered region of (p, y torsion 
pairs called shape strings as new features. In [24] we uti- 
lized the idea of under-sampling to create several balanced 
datasets. These balanced sets were used to train several 
SVMs classifiers independently. The SVMs were aggre- 
gated using a linear logistic regression model. 

In this paper, we propose a new approach called 
H-SVM-LR (Hybrid approach of SVMs and Logistic 
Regression (LR)) for predicting /{-turns. Our proposed 
approach incorporates the idea of clustering by parti- 
tioning the non-/J-turn class into three subsets using 
k-means clustering algorithm. Each subset is merged 
with the positive class (/J-turn) to form a sub training 
set. These sub training sets are used to train localized 
SVMs classifiers independently. LR model modeled 
using fractional polynomials, is used to aggregate the 
localized SVMs to make a collective decision. The merit 
of using LR to aggregate the localized SVMs is that it 
will enable us to take advantages of the statistical mod- 
eling theory to find the optimal weights for each local 
SVM [24]. Also LR has the advantages of being widely 
studied [25], and in the recent years there are many 
algorithms have been designed to improve its perfor- 
mance. These algorithms include iteratively re-weighted 
least squares (IRLS) algorithm, which is a special case of 
fisher's scoring method [26,27]. 

Methods 

Support vector machine (SVM) 

The SVM is a state-of-the-art supervised learning model 
with associated learning algorithm for analyzing and classi- 
fying data. It transfers the data from low dimensional 
space to high or infinite dimensional space and then con- 
struct a hyper-plane or hyper-planes in this higher dimen- 
sional space to classify the transformed data. Normally the 
training data are represented as points in a vector space. 
The hyper-plane with the largest distance to the nearest 
training data point is considered to be the good separator. 

Given a training set {x it j,},- = i /, where x t is a vector of 

features, and y t e {-1, 1}. SVM solves the following primal 
problem. 

1 1 

min-\\w\\ 2 + CJ2$i> (!) 

t=l 

subject to 
Yi(w.Xi + b) > 1 - %u I; > 0, i = 1, /, 
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where w is the normal vector to the hyper-plane, b is 
the offset from the origin, and C is the error penalty 
parameter. The kernel function, which maps the input 
space into a higher-dimensional space, can be applied to 
create SVM classifier for non-linear problem. The kernel 
functions that can be used for SVM include polynomial 
kernel function, radial basis (also known as Gaussian 
kernel function), and sigmoid kernel function. 

Logistic regression (LR) 

LR is a type of regression analysis used for predicting 
the outcome of a variable that can take on a limited 
number of classes. A detailed description of logistic 
regression can be found in [25]. In brief, given input 
vectors x t e R" and output values y t e {0, 1}, logistic 
regression can be fitted using the following likelihood to 
predict the probability of the output. This probability 
will be p if ji = 1, or 1 - p if y t = 0. 

n 

L{9) = Y\{p i )^\\-p i )^ (2) 

t=l 

However, it is easier mathematically to work with log 
of equation. The log-likelihood, where the log will turn 
products into sums, can be defined as follows: 

n 

lnL{8) = Mm + (1 - Yi)H^ ~ Pi)) (3) 
i=i 

The value of 8 that maximizes L{0) is called the maxi- 
mum likelihood estimate and it is denoted as For 
binary outputs, the loss function or the deviance (DEV) 
is the negative log-likelihood and is given by the follow- 
ing formula. 

DEV = -2lnL{6) (4) 

Minimizing the deviance given in the above equation 
is equivalent to maximizing the log-likelihood. 

Datasets 

The dataset BT426, which contains 426 non-homologous 
protein chains, is used to evaluate our H-SVM-LR predic- 
tion method. This dataset was developed by Guruprasad 
and Rajkumar [28]. We obtained it from Raghava Group's 
website http://www.imtech.res.in/ raghava/bteval/dataset. 
html. The structure of protein chains in BT426 dataset is 
determined by X-ray crystallography at two resolution or 
better. In each chain there is at least one beta-turns struc- 
ture. 24.9% of all amino acids in BT426 have been assigned 
to be having /J-turns structure. Several recent beta-turns 
prediction methods use it as a golden set of amino acid 
sequences to evaluate their performances. We therefore 
used it to evaluate our methods and to make direct com- 
parisons with the other prediction methods. Besides 



BT426, we used the dataset of 547 protein sequence 
(BT547), and the dataset of 823 protein sequence (BT823) 
to evaluate our approach. These datasets were constructed 
for training and testing COUDES [9]. 

Features 
PSSMs 

It has been shown that PSSMs contributed significantly 
to the accuracy of /3-turns prediction [1,2]. They are in 
the form of M*20, where M represents the sequence 
length. The PSSMs are generated using three rounds of 
the iterative PSI-BLAST program [23] against National 
Center for Biotechnology Information (NCBI) non- 
redundant (nr) sequence database with the default para- 
meters. The PSSMs values are scaled to values between 
0 and 1 using the following function. 



where x is the PSSM's element that stands for the 
likelihood of the particular residue substitution at that 
position. 

Predicted secondary structure (PSS) 

PROTEUS [29] is used to predict the secondary structure 
features. The motivation to use PROTEUS comes from 
the work of Tang et al. [22], which concludes that the pre- 
dictions when using PROTEUS and PSSMs were better 
than when using PHD [30], JPRED [31], PROTEUS, and 
PSSMs together. The secondary structure features are pre- 
dicted as three structure states: helix (H), strand (E) and 
coil (C). These three structure states are encoded as 1 0 0 
for helix, 0 10 for strand, and 0 0 1 for coil. 
Predicted shape strings 

Tang et al. [22] predicted shape strings from a predictor 
constructed based on structural alignment approach. 
Shape strings were represented by eight states, i.e. S, R, 
U, V, K, A, T and G. They used a sliding window of 8 
amino acids on PSSMs, PSS and shape strings features. 
We also added shape strings to our PSSMs and PSS fea- 
tures. The shape strings were predicted using the protein 
shape string and its profile prediction server (DSP) [32]. 
Besides the eight states DSP defines shape N where the (p 
and y/ angles are undefined, or no structure determina- 
tion for parts of the sequence. The shape strings features 
are encoded as (1 0 0 0 0 0 0 0 0) for S, (0 1 0 0 0 0 0 0 0) 
for R, and (0 0 0 0 0 0 0 0 1) for N. 

The proposed approach 

The entire framework of our proposed approach is 
shown in Figure 1. Three SVM classifiers are con- 
structed using inputs from three clustered model. Then 
these three SVMs classifiers are integrated with logistic 
regression model. Statistical model selection based on 
fractional polynomials is used to take advantage of each 



Elbashir et al. Proteome Science 2013, 11(Suppl 1):S5 
http://www.proteomesci.eom/content/1 1/S1/S5 



Page 4 of 10 



Sequence 



Sequence 



r 


\ 




PSSMs 




Predicted secondary 


(7x20) 




structure (7x3) 



\ 


1 1 


PSSMs 
(7x20) 


Predicted secondary 
structure (7x3) 


Predicted shape 
string (7x9) 



Feature vector 



SVM modell 



■* Feature vector 



SVM mode!2 



SVM mode!3 



£ 



SVM modell 



SVM mode!2 



1 



SVM mode!3 



1 > 


The signed distances 


< ' 







> The signed distances 



LR model 



Prediction 



LR model 
Prediction 



(a) (b) 

Figure 1 The architecture of the proposed prediction method. Figure 1(a) represents the prediction using PSSMs, and PSS, while Figure 1(b) 
represents the prediction using PSSMs, PSS, and shape strings. 7 denotes the window size, the PSSMs have 20 columns and there are 3 
secondary structure states and 9 shape string states. 



classifier such that the final global classifier could have a 
better performance. 

A sliding window of size seven residues is used over 
the matrix that consists of the features. The prediction 
is made for the central residue. This window size is 
selected in accordance with Shepherd et al. [13] who 
found that the optimal prediction for /J-turns is achieved 
using window size of seven or nine. 

Clustered model 

Since /3-turns account for approximately 25% of the globu- 
lar protein residues, the ratio of /3-turns to non-/3-turns is 
1:3. Thus, the training sets used for /3-turns prediction are 
imbalanced sets. In our trail experiments, we found that if 
the non-/f-turns set is divided into a three subsets by a sui- 
table clustering algorithm, each non-/J-turns subset with 
the whole /J-turns set will form approximately balanced 
training set. This balanced training set is more likely to be 
separable in the feature space. That is because the distri- 
bution of the non-/2-turns samples in a subset is centra- 
lized and compacted. In other words, the /2-turns set can 
be easily separated from each non-/J-turns cluster by a dif- 
ferent hyper-plane. That means good performance would 



be expected when constructing localized SVMs using each 
non-/J-turns cluster against the /J-turns. But, each of these 
SVMs alone is certainly not a good global classifier. It pro- 
poses that it is possible to construct a better classifier than 
the SVM trained with the whole data by combining these 
SVMs effectively. Particularly, a localized SVM classifier 
can be constructed for each sub training set, this way the 
localized SVMs will not be affected by the heterogeneity of 
the whole training set. To outperform the SVM that is 
trained with the whole data, we need to combine these 
localized SVMs effectively into global one without neglect- 
ing their local advantages. Majority voting is one of the 
methods that are used to combine several classifiers, but 
its main problem is that it will not give weight to each 
classifier. LR model can integrate the localized SVMs clas- 
sifiers, and it allows us to take advantages of the statistical 
modeling theory to find the optimal weights for each local 
classifier. The motivation to use this clustered model 
comes from the work of Yi Chang [33] . In his work, Yi 
Chang used localized linear SVMs classifier for a data in 
the feature space defined by a chosen kernel. 

At the very beginning, the whole negative examples 
are divided into three clusters by a £-means clustering 
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algorithm using original variables. The distribution of 
those three clusters is shown in Figure 2. We merged 
the whole positive examples with each cluster to form 
three sub-training sets. These sub-training sets are used 
to build three SVMs models. The three SVMs will not 
be used directly in the prediction, but they will be used 
as variable generators. During training and prediction 
stages, these models are unchanged and all the samples 
enter all of the three models. The signed distance for 
each example to the separating hyper-planes of the 
three models is computed and stored in a vector d of 
dimension (N * 3), where N is the number of the 
instances. The vector d will be used as a new feature 
vector for a LR model, which will weigh the response of 
the three models and then calculates the prediction 
probability. 
LR model selection 

The components of the LR predictive model are obviously 
variables, which should be selected carefully so that the 
model makes accurate prediction, but without over-fitting 
the data. There are two competing goals in model selec- 
tion. (1) It should be complex to fit the data well. (2) It 



should be simple to interpret. To select our LR model, we 
first looked at the correlation in the estimated coefficient. 
If two variables are highly correlated, we do not need both 
of them in the model. The uni-variate analysis was used to 
identify the important variables, in which the LR models 
with one variable at a time were fitted, and then the fits 
were analyzed. In particular, we looked at the estimated 
coefficients, their standard errors and the likelihood ratio 
test for the significance of the coefficients. Then we fitted 
our LR using the variables selected in the uni-variate ana- 
lysis according to the following procedure: 

- We verified the importance of each variable in the 
LR model using Wald statistics. 

- We compared the coefficients of the each variable with 
the coefficient from the model containing only that variable. 

- Any variable that did not appear to be important was 
eliminated, and a new model was fitted. The new model 
was checked whether it is significantly different from the 
old model. If it is, then the deleted variable is important. 

- The process of deleting, refitting and verifying was 
repeated until it appears that all the important variables 
were included in the model. 




PC2 



-4 -4 



PC1 



Figure 2 The distribution of the three clusters. The axes represents the top 3 PCs of principal component analysis (PCA) of negative samples 
(non-/3-turns). Red dots denote samples in cluster 1, blue denotes samples in cluster 2, and green denotes samples in cluster 3. 
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- We tried to fit a linear LR model to the data but the 
prediction error is found to be very large, so we consid- 
ered power transformation using fractional polynomials. 

- A list of possible interactions between each pairs of 
variable was created, these interactions terms were added 
one at a time, in the model containing all the main effects 
and assess its significance using the likelihood ratio test. 
The significant interactions were added to the main effect 
model and its fit was evaluated using Wald tests and LR 
test for the interaction terms, and any non-significant 
interaction was dropped. 

Fractional polynomials 

The final outcome variable is the /J-turn/non-turn 
response. In our hybrid model, this variable depends on 
the outcome of the three SVMs classifiers in a logistic 
regression model. The outcome of the three SVMs classi- 
fiers is represented by the vector d = {d lt d 2 , d 3 ). The nat- 
ural starting point, the straight line model b 0 + b\d\ + 
b 2 d 2 + b-id-i or bo + dB in matrix form, where B is the 
vector of parameters, is first tested whether it is adequate. 
To improve the fit, we investigated other models. We 
looked for non-linearity by fitting a first order fractional 
polynomial to the data. The best power transformation 
d\ was found, with the power p chosen from candidates 

-2, -1, -0.5, 0, 0.5, 1, 2, 3, where d° denotes log(di). The 
set includes the straight line (i.e. no transformation) 
p = 1. The variables d t contain non-positive values, thus 
we transformed its values to values > 0, which will enable 
the use of logarithms and negative powers transforma- 
tion. Including more powers usually offers only slight 
improvement in the model fit. In particular, there is a 
problem with including large negative powers, such as -3, 
that individual extreme observations will influence the fit 
too much [34]. The first-degree fractional polynomial 
provides unsatisfactory fit to our data, so we considered 
second-degree fractional polynomial. We used the closed 
test procedure, which first determine the best-fitting sec- 
ond degree polynomial by choosing the powers transfor- 
mation p and q from the aforementioned set. For 
mathematical limit, when p = q for the variable d t in the 
model then the terms of the variable will be written in 
the form b^d\ + b^logidi) ■ The best fit among the com- 
binations of such powers is defined as that which maxi- 
mizes the likelihood or equivalently that which 
minimizes the deviance [35]. The MFP package, which is 
a collection of R [36] functions targeted at the use of 
fractional polynomials for modeling the influence of con- 
tinuous variables on the outcome in regression models is 
used in this research to find the best fit among the com- 
binations of the powers p and q. MFP uses a sequential 
and a closed testing selection procedures for a single con- 
tinuous variable. Using the BT426 dataset, our final 



model is selected after two cycles. The results of the 
model selection are shown in Table 1. The best-fit frac- 
tional polynomials (fractional polynomials with the low- 
est deviance) for SVM modell, SVM model2, and SVM 
model3 are underlined. 

Training and testing 

We used LIBSVM package [37] to train and build the 
SVMs prediction models. The radial basis kernel func- 
tion was used to transfer the data from a low dimension 
space to a higher-dimensional space nonlinearly for all 
the SVMs. The default grid search approach was used 
to find the optimal values for the LIBSVM's parameters 
C and gamma. The leave-one-out cross-validation test, 
in which different datasets for training and testing are 
used to evaluate a prediction method, is an accurate test 
method compared with independent dataset test and 
sub-dataset test [38]. When using this test, one protein 
out of N proteins is removed to represent the testing set 
and the remaining N-l proteins are combined together 
to represent the training set that will be used for train- 
ing the prediction method. This process is then repeated 
N times by removing one protein in each time. In 
/3-turns prediction, applying this process exactly is time 
consuming. Thus, most of the state-of-the-art /3-turns 
prediction methods use seven-fold cross validation to 
assess their prediction performances [39]. Therefore, we 
used seven-fold cross validation to assess the perfor- 
mance of our H-SVM-LR method. We first started by 
dividing the dataset into seven subsets that contain 
equal numbers of proteins. In each set the /J-turns 
account for approximately 25% of the protein residues, 
in other words each set contains the naturally-accruing 

Table 1 Fractional polynomials for the SVMs models 
using the BT426 dataset. 

Cycle 1 Cycle 2 

Variable Powers Powers 

Deviance Deviance 







P 


q 




P 


q 


SVM model 1 


256272.1 
256235.6 


1 




256255.1 
256209.8 


1 






256180.1 


-0.5 




256146.1 


-0.5 






256080.4 


1 


2 


256035.3 


1 


2 


SVM model2 


257266.9 
256512.8 


1 




257050.1 
256314.3 


1 






256284.1 


0 




256086.0 


0 






256235.6 


0.5 


1 


256035.3 


0.5 


1 


SVM modeB 


258586.7 
256669.1 


1 




258511.7 
256247.5 


1 






256626.6 


0.5 




256148.6 


0.5 






256512.8 


2 


3 


256035.3 


2 


2 
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proportion of beta-turns. We removed one set to repre- 
sent the testing set and the other sets were merged 
together in one training set, which is used to train 
H-SVM-LR. This process was repeated seven times in 
order to have a different set for testing each time. We 
take the average of the results from the seven testing 
sets to represent the final prediction result. 

Performance measures 

The quality of prediction is evaluated using four mea- 
sures, the prediction accuracy, Qpredicted, Qobserved, 
and MCC. These measures are the most frequently used 
measures to evaluate the /3-turns prediction methods. 
They are calculated using the four values (i) true posi- 
tive (TP), which is the number of the residues that are 
correctly classified as /3-turns, (ii) true negative (TN), 
which is the number of the residues that are correctly 
classified as non-/J-turns, (Hi) false positive (FP), which 
is the number of residues that have non-/f-turns struc- 
ture and incorrectly classified as having /J-turns struc- 
ture, and (iv) false negative (FN), which is the number 
of residues that have /3-turns structure and incorrectly 
classified as having non-/3-turns structure. 

The prediction accuracy (also known as Qtotal) refers 
to the percentage of correctly classified residues and is 
calculated as follows: 



MCC- 



TP*TN-FP* FN 



Qtotal ■■ 



TP + TN 



TP + TN + FP + FN 



x 100 



(6) 



Qpredicted (also known as the predicted positive value 
(PPV) or the probability of correct prediction) refers to 
the percentage of the residues that are correctly pre- 
dicted as /{-turns among the predicted ones and is cal- 
culated as follows: 



Qpredicted 



TP 



TP + FP 



x 100 



(7) 



Qobserved (also known as sensitivity or coverage) 
refers to the percentage of the residues that are correctly 
predicted to have /3-turns structure among those 
observed as having /3-turns structure. In other words, it 
represents the fraction of the total positive samples that 
are correctly predicted and it is calculated as follows: 



Qobserved = 



TP 



TP + FN 



x 100 



(8) 



Because of the imbalanced dataset (25% /2-turns), Qtotal 
by itself is a poor measure. In other words, one can 
achieve a Qtotal of 75% (baseline accuracy) by predicting 
all the residues to be non-/J-turns. Therefore, Matthew's 
correlation coefficient (MCC) [40] is an important, robust 
and reliable performance measure. The MCC can be 
obtained using the following formula: 



y(TP + FP) * (TP + FN) * (TN + FP) * (TN + FN) 



(9) 



Normally, the value of MCC is greater than or equal 
to -1 and less than or equal to 1. If the value of MCC is 
close to 1 then there is a perfect positive correlation, if 
it is close to -1 then there is a perfect negative correla- 
tion, and a value close to 0 indicates no correlation. 

The receiver operating characteristic (ROC) curve is 
adopted in this paper as a threshold independent mea- 
sure. The ROC curve provides the effectiveness of 
j3-turns prediction method. The area under the ROC 
curve (AUC) is an important index that reflects the pre- 
diction reliability. A good classifier has an area close to 
1, while a random classifier has an area of 0.5. 

Results and discussion 

The methods that are applied on /J-turns prediction use 
different PSSMs and PSS organizations. Some research- 
ers use a sliding window on the PSSMs and then add 
the PSS e.g. [18]. Other researchers use a sliding win- 
dow on both PSSMs and PSS e.g. [20]. Both ways are 
tested in our proposed method and the results for the 
BT426 dataset are shown in Table 2. 

From the results we found that the performance of 
H-SVM-LR using a sliding window on both PSSMs and 
PSS is by far better than using a sliding window on 
PSSMs only and then add the PSS for the central amino 
acid. Figure 3 shows the ROC curves for /J-turns predic- 
tion using a sliding window on PSSMs only and a sliding 
window on both PSSMs and PSS. The AUC highlights 
the effect of using a sliding window on both PSSMs and 
PSS. The AUC value using a sliding window on both 
PSSMs and PSS is 0.89, 0.03 higher than using a sliding 
window on the PSSMs only. 

Table 3 shows the comparison between H-SVM-LR and 
other existing /J-turns prediction methods based on seven- 
fold cross validation on the BT426 dataset. H-SVM-LR 
achieves prediction accuracy or Qtotal = 82.87%, Qpre- 
dicted= 64.83%, Qobserved = 70.66%, and MCC = 0.56. 
The Qtotal of H-SVM-LR is the highest among the exist- 
ing methods that use PSSMs and PSS as features; i.e. 
Zheng and Kurgan's method and the method of Liu et al. 
achieved Qtotal of 80.9. The difference in Qtotal between 
H-SVM-LR and these methods is 1.97%. We emphasize 
that this difference is relatively large when considering 

Table 2 Performance comparison between different 
features organization on the BT426 dataset 



Features organization 


Qtotal 


Qpredicted 


Qobserved MCC 


A sliding window on PSSMs 


81.03 


63.98 


57.40 0.48 


only 








A sliding window on both 


82.87 


64.83 


70.66 0.56 


PSSMs and PSS 
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Figure 3 ROC curves for the prediction using a sliding window 
on PSSMs only and sliding window on both PSSMs and PSS 

Blue curve corresponds to the prediction using sliding window on 
both PSSMs and PSS, while the green curve corresponds to the 
prediction using a sliding window on PSSMs only. The dataset used 
for drawing the curves is BT426. 



that the baseline accuracy equals to 75%, which could be 
obtained by merely regarding all residues as non-/3-turns. 
i.e., H-SVM-LR provides 7.87/25 = 31.5% error rate reduc- 
tion, while Zheng and Kurgan's method and the method 
of Liu et al. provide 5.9/25 = 24% error rate reduction, and 



Table 3 Comparison of H-SVM-LR with other /3-turns 
prediction methods on the BT426 dataset 



Prediction method 


Qtotal 


Qpredicted 


Qobserved 


MCC 


H-SVM-LR 


82.87 


64.83 


70.66 


0.56 


Zheng and Kurgan [2] 


80.9 


62.7 


55.6 


0.47 


Liu et al. [20] 


80.9 


63.6 


49.2 


0.44 


Hu and Li [19] 


79.8 


55.6 


68.9 


0.47 


DEBT [21] 


79.2 


54.8 


70.1 


0.48 


BTSVM [1 7] 


78.7 


56.0 


62.0 


0.45 


NetTurnP [1] 


78.2 


54.4 


75.6 


0.50 


MOLEBRNN [15] 


77.9 


53.9 


66.0 


0.45 


Zhang et al.(multiple 


77.3 


53.1 


67.0 


0.45 


alignment) [18] 










BetaTPred2 [14] 


75.5 


49.8 


72.3 


0.43 


Kim [16] 


75.0 


46.5 


66.7 


0.40 


COUDES [9] 


74.8 


48.8 


69.9 


0.42 


BTPRED [13] 


74.4 


48.3 


57.3 


0.35 



a Note: The results of the method of Liu et al. and NetTurnP method are 
obtained from their corresponding papers. The results of other jS-turns 
prediction methods are obtained from [22]. 



Hu and Li's method provides 4.8/25 = 19% error rate 
reduction. 

H-SVM-LR shows high MCC 0.56 compared to Net- 
TurnP 0.50, Zheng and Kurgan's method 0.47, and the 
method of Liu et al. 0.44. Thus, H-SVM-LR has the high- 
est MCC and Qtotal among the other /3-turns prediction 
methods. The MCC value achieved is noteworthy since 
MCC accounts for both over predictions and under pre- 
dictions. The Qobserved of H-SVM-LR is higher by 
15.06% than the Qobserved of Zheng and Kurgan's 
method, by 1.76% than the Qobserved of Hu and Li's 
method, and by 21.46% than the Qobserved of the method 
of Liu et al. Higher Qobserved values mean that a large 
percentage of the observed /3-urns is correctly predicted. 
At the same time, the Qpredicted of our method shows 
that more than 64% of the actual /3-turns are correctly pre- 
dicted. We note that the Qpredicted of H-SVM-LR is 
2.13% higher than the Qpredicted of Zheng and Kurgan's 
method, by 9.23% than the Qpredicted of Hu and Li's 
method, and by 1.23% higher than the Qpredicted of the 
method of Liu et al. 

Besides BT426 dataset that is used for training and 
testing H-SVM-LR, we used two additional datasets, i.e. 
BT547 and BT823 datasets, to validate its performance. 
Results obtained based on seven-fold cross validation on 
these datasets are given in Table 4. The results show that 
for the BT547 dataset H-SVM-LR obtains Qtotal = 
82.84%, Qpredicted = 63.60%, Qobserved = 68.50%, and 
MCC = 0.55. The MCC and Qtotal of H-SVM-LR are the 
best among the other competing methods that are evalu- 
ated on BT547 dataset. We note that the Qpredicted of 
H-SVM-LR is 0.7% lower than the Qpredicted of the 
method of Liu et al., while the Qobserved of H-SVM-LR 



Table 4 Comparison of H-SVM-LR with other /J-turns 



prediction methods on BT547 and BT823 datasets. 


Prediction method Dataset 


Qtotal 


Qpredicted 


Qobserved 


MCC 


H-SVM-LR 


82.84 


63.60 


68.5 


0.55 


Zheng and Kurgan 
[2] 


80.5 


61.6 


54.2 


0.45 


Liu et al. [20] BT547 


80.6 


64.3 


44.5 


0.44 


Hu and Li [19] 


76.6 


47.6 


70.2 


0.43 


DEBT [21] 


80.0 


55.9 


68.7 


0.49 


COUDES [9] 


74.6 


48.7 


70.4 


0.42 


H-SVM-LR 


82.32 


64.48 


72.72 


0.56 


Zheng and Kurgan 
[2] 


80.6 


60.8 


54.6 


0.45 


Liu et al. [20] BT823 


80.5 


62.3 


44.6 


0.44 


Hu and Li [19] 


76.8 


53.0 


72.3 


0.45 


DEBT [21] 


80.9 


55.9 


66.1 


0.48 


COUDES [9] 


74.2 


47.5 


69.6 


0.41 



a Note: The results of the method of Liu et al. are obtained from their 
corresponding paper. The results of other /J-turns prediction methods are 
obtained from [22]. 
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is 24% higher than the Qobserved of the method of Liu et 
al. The increase in the Qobserved value is a trade-off for 
the decrease in the Qpredicted value. In spite of this 
trade off, H-SVM-LR shows high overall accuracy. For 
the BT823 dataset H-SVM-LR obtains Qtotal = 82.32%, 
Qpredicted = 64.48%, Qobserved = 72.72%, and MCC = 
0.56. Also H-SVM-LR has the highest MCC, Qtotal, 
Qpredicted, and Qobserved on BT823 datasets. The 
results also show that H-SVM-LR shows stable perfor- 
mances on all the three datasets used. Note that we used 
the same LR model that is used for testing BT426. These 
results indicate that H-SVM-LR can better discriminate 
between /3-turns and non-/J-turns. 

Including shape strings features 

The comparisons between H-SVM-LR after including 
the shape strings features and the method of Tang et al. 
on the BT426, BT547, and BT823 are shown in Table 5. 
Figure 4 depicts the ROC curves for /i-turns prediction 
using H-SVM-LR before and after adding the shape 
strings for the BT426 dataset. The AUC value when 
including the shape strings is 0.923, while the AUC 
value when using PSSMs and PSS only is 0.886. 

Conclusions 

In this paper, we proposed an approach that combines 
SVM and LR to create a hybrid method for /3-turns pre- 
diction. We called this hybrid method H-SVM-LR. In 
H-SVM-LR, we utilized protein profile in the form of 
PSSMs, and PSS as features. We also considered shape 
strings as additional features. We divided the non-/J- 
turn class into three partitions using /c-means clustering 
algorithm and then each partition is combined with the 
/J-turn class to form approximately balanced sub-train- 
ing sets. SVM classifier is used for each sub-training set. 
Using this procedure, the problem of imbalanced class 
can be overcome, and the SVM computational time can 
be reduced. LR model selected based on fractional poly- 
nomials is used to aggregate the decisions of the SVMs 
to come up with final /J-turn or non-/J-turn decision. 
Using LR to aggregate the decisions of the SVMs 
enables us to take advantages of the statistical modeling 
theory to find the optimal weights for each SVM. H- 



Table 5 Comparison of H-SVM-LR with the method of 
Tang et al. [22]. 



Prediction method 


Dataset 


Qtotal 


Qpredicted 


Qobserved 


MCC 


H-SVM-LR 


BT426 


87.37 


74.99 


75.20 


0.67 


Tang et al. 




87.2 


73.8 


75.9 


0.66 


H-SVM-LR 


BT547 


88.64 


77.79 


76.31 


0.70 


Tang et al. 




87.3 


69.8 


86.5 


0.69 


H-SVM-LR 


BT823 


89.55 


79.53 


77.73 


0.72 


Tang et al. 




88.7 


72.6 


88.1 


0.73 




1 1 1 1 

0.0 0.2 0.4 0.6 0.3 1.0 

1 - Specificity 



Figure 4 ROC curves for the prediction before and after 
including the predicted shape strings. Blue curve corresponds to 
the prediction after including the predicted shape strings, while the 
green curve corresponds to the prediction before including the 
predicted shape strings. The dataset used for drawing the curves is 
BT426. 



SVM-LR achieved MCC of 0.56, and Qtotal of 82.87% 
on the BT426 dataset when using PSSMs and PSS as 
features. The MCC and the Qtotal achieved are signifi- 
cantly higher than the best existing methods that predict 
beta-turns using PSSM and PSS. Also H-SVM-LR 
obtained the highest MCC and Qtotal on BT547 and 
BT823 datasets. Furthermore, H-SVM-LR shows good 
performance when including shape strings features. 
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